[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

HyukjinKwon · 2018-02-12T12:08:46Z

What changes were proposed in this pull request?

This PR backports #20531:

It explicitly specifies supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in returnType ahead with documenting this; however, it happened to fix multiple things.

Currently, we don't support BinaryType in Pandas UDFs, for example, see:

from pyspark.sql.functions import pandas_udf
pudf = pandas_udf(lambda x: x, "binary")
df = spark.createDataFrame([[bytearray(1)]])
df.select(pudf("_1")).show()

...
TypeError: Unsupported type in conversion to Arrow: BinaryType

We can document this behaviour for its guide.

Since we can check the return type ahead, we can fail fast before actual execution.

# we can fail fast at this stage because we know the schema ahead
pandas_udf(lambda x: x, BinaryType())

How was this patch tested?

Manually tested and unit tests for BinaryType and ArrayType(...) were added.

This PR targets to explicitly specify supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case. ```python from pyspark.sql.functions import pandas_udf, PandasUDFType foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG) df = spark.range(100).selectExpr("id", "array(id) as value") df.groupBy("id").agg(foo("value")).show() ``` ``` ... NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG ``` 3. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20531 from HyukjinKwon/pudf-cleanup. (cherry picked from commit c338c8c) Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

HyukjinKwon · 2018-02-12T12:08:59Z

cc @ueshin

ueshin · 2018-02-12T13:01:24Z

LGTM.

SparkQA · 2018-02-12T15:43:38Z

Test build #87333 has finished for PR 20588 at commit 44fe840.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-12T17:40:54Z

python/pyspark/sql/tests.py

-    def test_simple(self):
-        from pyspark.sql.functions import pandas_udf, PandasUDFType
-        df = self.data
+    def test_supported_types(self):


I start to worry about the test coverage of vectorized udfs and arrow-based to/from pandas df. Do we have any plan in PySpark to test all the data types?

Yup, agree. Yup, I was thinking of doing it. But if you (or your colleagues) are working on that or have a plan, no need to block it by me :). please go ahead.

Maybe open a JIRA and ask the OSS community to do it?

Yup. Filed SPARK-23401.

gatorsmile · 2018-02-12T23:22:19Z

@HyukjinKwon Could you update the PR description? This will be part of the commit. Thus, it would be nice to document the exact changes made in this PR.

HyukjinKwon · 2018-02-13T00:01:04Z

Yup, will update soon.

gatorsmile · 2018-02-13T00:47:02Z

This PR contains multiple fixes. This is not good especially for the ones targeting to 2.3.0. We should split it to multiple independent PRs if possible.

cc @ueshin

Thanks! Merged to 2.3.

…in Pandas UDFs ## What changes were proposed in this pull request? This PR backports #20531: It explicitly specifies supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` ## How was this patch tested? Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20588 from HyukjinKwon/PR_TOOL_PICK_PR_20531_BRANCH-2.3.

gatorsmile · 2018-02-13T00:48:23Z

Could you close it?

HyukjinKwon · 2018-02-13T00:58:01Z

This PR contained one targeted change that fixes multiple problems, to be more clear.

gatorsmile reviewed Feb 12, 2018

View reviewed changes

HyukjinKwon closed this Feb 13, 2018

HyukjinKwon deleted the PR_TOOL_PICK_PR_20531_BRANCH-2.3 branch October 16, 2018 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

HyukjinKwon commented Feb 12, 2018 •

edited

Loading

HyukjinKwon commented Feb 12, 2018

ueshin commented Feb 12, 2018

SparkQA commented Feb 12, 2018

gatorsmile Feb 12, 2018

HyukjinKwon Feb 12, 2018

gatorsmile Feb 12, 2018

HyukjinKwon Feb 12, 2018

gatorsmile commented Feb 12, 2018

HyukjinKwon commented Feb 13, 2018

gatorsmile commented Feb 13, 2018

gatorsmile commented Feb 13, 2018

HyukjinKwon commented Feb 13, 2018

[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

[SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in Pandas UDFs #20588

Conversation

HyukjinKwon commented Feb 12, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Feb 12, 2018

ueshin commented Feb 12, 2018

SparkQA commented Feb 12, 2018

gatorsmile Feb 12, 2018

Choose a reason for hiding this comment

HyukjinKwon Feb 12, 2018

Choose a reason for hiding this comment

gatorsmile Feb 12, 2018

Choose a reason for hiding this comment

HyukjinKwon Feb 12, 2018

Choose a reason for hiding this comment

gatorsmile commented Feb 12, 2018

HyukjinKwon commented Feb 13, 2018

gatorsmile commented Feb 13, 2018

gatorsmile commented Feb 13, 2018

HyukjinKwon commented Feb 13, 2018

HyukjinKwon commented Feb 12, 2018 •

edited

Loading